Thoughts on Word and Sentence Segmentation in Thai

نویسنده

  • Wirote Aroonmanakun
چکیده

This paper discusses problems of word and sentence segmentation in Thai. Disagreements on word segmentation are caused mostly from compound words. To set a standard resource and tool of word segmentation, we suggest that only simple words and true compound words should be segmented in the process of word segmentation. Other compounds can be grouped later by the same means as multiword identification in other languages. Sentence segmentation is also difficult because the boundary of sentence in Thai is fuzzy. We suggest that a discourse should be seen as a combination of clauses rather than sentences. Some discourse clues then can be used to segment these discourse units. The result from sentence segmentation module could be a sequence of segments composed of clauses, which then can be constructed into the discourse structure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context Sensitive Pattern Based Segmentation: A Thai Challenge

A Thai written text is a string of symbols without explicit word boundary markup. A method for a development of a segmentation tool from a corpus of already segmented text is described. The methodology is based on the technology of competing patterns, evolved from algorithm for English hyphenation. A new UNICODE pattern generation program, OPATGEN, is used for the learning phase. We have shown ...

متن کامل

A Lexicalized Tree Adjoining Grammar for Thai

This paper describes an alternative formalism for Thai syntax parsing based on a lexicalized tree adjoining grammar (LTAG). We first briefly present some formal background concerning LTAG, which is necessary for an understanding of LTAG and its application to Thai. Specifically, we address several issues regarding difficulties in parsing Thai sentences and how to resolve these issues using LTAG...

متن کامل

Thai News Text Summarization and Its Application

Since Thai language lacks word/phrase/sentence boundaries, document summarization in Thai needs investigations in unit segmentation, unit selection, redundancy removal and evaluation dataset construction. In this work, we have proposed Thai Elementary Discourse Unit (TEDU) and a three-stage method of Thai multidocument summarization, i.e., unit segmentation, unit-graph formulation, and unit sel...

متن کامل

A Multi-Aspect Comparison and Evaluation on Thai Word Segmentation Programs

Word segmentation is an important task in natural language processing, especially for languages without word boundaries, such as Thai language. Many Thai word segmentation programs have been developed. Researchers and developers in Thai documents usually spend a tremendous amount of time in studying and trying different Thai word segmentation programs. This paper presents the performance of six...

متن کامل

Panel: The State of the Art in Thai Language Processing

This paper reviews the current state of technology and research progress in the Thai language processing. It resumes the characteristics of the Thai language and the approaches to overcome the difficulties in each processing task. 1 Some Problematic Issues in the Thai Processing It is obvious that the most fundamental semantic unit in a language is the word. Words are explicitly identified in t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007